10 research outputs found

    State of the art document clustering algorithms based on semantic similarity

    Get PDF
    The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures

    AFFINITY PROPAGATION AND K-MEANS ALGORITHM FOR DOCUMENT CLUSTERING BASED ON SEMANTIC SIMILARITY

    Get PDF
    Clustering text documents is the process of dividing textual material into groups or clusters. Due to the large volume of text documents in electronic forms that have been made with the development of internet technology, document clustering has gained considerable attention. Data mining methods for grouping these texts into meaningful clusters are becoming a critical method. Clustering is a branch of data mining that is a blind process used to group data by a similarity known as a cluster. However, the clustering should be based on semantic similarity rather than using syntactic notions, which means the documents should be clustered according to their meaning rather than keywords. This article presents a novel strategy for categorizing articles based on semantic similarity. This is achieved by extracting document descriptions from the IMDB and Wikipedia databases. The vector space is then formed using TFIDF, and clustering is accomplished using the Affinity propagation and K-means methods. The findings are computed and presented on an interactive website

    A survey of exploratory search systems based on LOD resources

    Get PDF
    The fact that the existing Web allows people to effortlessly share data over the Internet has resulted in the accumulation of vast amounts of information available on the Web.Therefore, a powerful search technology that will allow retrieval of relevant information is one of the main requirements for the success of the Web which is complicated further due to use of many different formats for storing information. Semantic Web technology plays a major role in resolving this problem by permitting the search engines to retrieve meaningful information. Exploratory search system, a special information seeking and exploration approach, supports users who are unfamiliar with a topic or whose search goals are vague and unfocused to learn and investigate a topic through a set of activities. In order to achieve exploratory search goals Linked Open Data (LOD) can be used to help search systems in retrieving related data, so the investigation task runs smoothly.This paper provides an overview of the Semantic Web Technology, Linked Data and search strategies, followed by a survey of the state of the art Exploratory Search Systems based on LOD.Finally the systems are compared in various aspects such as algorithms, result rankings and explanations

    A State of Art Survey for OS Performance Improvement

    Get PDF
    Through the huge growth of heavy computing applications which require a high level of performance, it is observed that the interest of monitoring operating system performance has also demanded to be grown widely. In the past several years since OS performance has become a critical issue, many research studies have been produced to investigate and evaluate the stability status of OSs performance. This paper presents a survey of the most important and state of the art approaches and models to be used for performance measurement and evaluation. Furthermore, the research marks the capabilities of the performance-improvement of different operating systems using multiple metrics. The selection of metrics which will be used for monitoring the performance depends on monitoring goals and performance requirements. Many previous works related to this subject have been addressed, explained in details, and compared to highlight the top important features that will very beneficial to be depended for the best approach selection

    Further Development of BitTorrent Simulator in Erlang

    No full text
    Among many P2P file-sharing protocols in existence, BitTorrent is one of the few that has attracted significant attention by a wide range of users. It uses variety of algorithms for peer selection, piece selection and other tasks. Having a simulator that facilitates investigating of applying different strategies in implementing components of a P2P would be of great advantages. An Erlang based BitTorrent simulator was developed by IT department at Uppsala University. The network side of the project had been rewritten in order to improve the functionality of the application. In this thesis work, a new and modular design approach for the client side of the implementation was employed, documented and incorporated into the application. All nodes run in parallel, and they communicate with each other through the newly developed network module. A variety of options for the BitTorrent simulator are supported in the implementation, algorithms of the typical structure can easily be exchanged and used to experiment new ideas to find out how the swarm is affected by different approaches in implementing BitTorrent clients and trackers. The report also reviews the structure of the previous thesis work, and explains the modifications made to the previously developed network module

    Further Development of BitTorrent Simulator in Erlang

    No full text
    Among many P2P file-sharing protocols in existence, BitTorrent is one of the few that has attracted significant attention by a wide range of users. It uses variety of algorithms for peer selection, piece selection and other tasks. Having a simulator that facilitates investigating of applying different strategies in implementing components of a P2P would be of great advantages. An Erlang based BitTorrent simulator was developed by IT department at Uppsala University. The network side of the project had been rewritten in order to improve the functionality of the application. In this thesis work, a new and modular design approach for the client side of the implementation was employed, documented and incorporated into the application. All nodes run in parallel, and they communicate with each other through the newly developed network module. A variety of options for the BitTorrent simulator are supported in the implementation, algorithms of the typical structure can easily be exchanged and used to experiment new ideas to find out how the swarm is affected by different approaches in implementing BitTorrent clients and trackers. The report also reviews the structure of the previous thesis work, and explains the modifications made to the previously developed network module

    Further Development of BitTorrent Simulator in Erlang

    No full text
    Among many P2P file-sharing protocols in existence, BitTorrent is one of the few that has attracted significant attention by a wide range of users. It uses variety of algorithms for peer selection, piece selection and other tasks. Having a simulator that facilitates investigating of applying different strategies in implementing components of a P2P would be of great advantages. An Erlang based BitTorrent simulator was developed by IT department at Uppsala University. The network side of the project had been rewritten in order to improve the functionality of the application. In this thesis work, a new and modular design approach for the client side of the implementation was employed, documented and incorporated into the application. All nodes run in parallel, and they communicate with each other through the newly developed network module. A variety of options for the BitTorrent simulator are supported in the implementation, algorithms of the typical structure can easily be exchanged and used to experiment new ideas to find out how the swarm is affected by different approaches in implementing BitTorrent clients and trackers. The report also reviews the structure of the previous thesis work, and explains the modifications made to the previously developed network module

    A Semantics-Based Clustering Approach for Online Laboratories Using K-Means and HAC Algorithms

    No full text
    Due to the availability of a vast amount of unstructured data in various forms (e.g., the web, social networks, etc.), the clustering of text documents has become increasingly important. Traditional clustering algorithms have not been able to solve this problem because the semantic relationships between words could not accurately represent the meaning of the documents. Thus, semantic document clustering has been extensively utilized to enhance the quality of text clustering. This method is called unsupervised learning and it involves grouping documents based on their meaning, not on common keywords. This paper introduces a new method that groups documents from online laboratory repositories based on the semantic similarity approach. In this work, the dataset is collected first by crawling the short real-time descriptions of the online laboratories’ repositories from the Web. A vector space is created using frequency-inverse document frequency (TF-IDF) and clustering is done using the K-Means and Hierarchical Agglomerative Clustering (HAC) algorithms with different linkages. Three scenarios are considered: without preprocessing (WoPP); preprocessing with steaming (PPwS); and preprocessing without steaming (PPWoS). Several metrics have been used for evaluating experiments: Silhouette average, purity, V-measure, F1-measure, accuracy score, homogeneity score, completeness and NMI score (consisting of five datasets: online labs, 20 NewsGroups, Txt_sentoken, NLTK_Brown and NLTK_Reuters). Finally, by creating an interactive webpage, the results of the proposed work are contrasted and visualized

    Towards a Complete Kurdish NLP Pipeline: Challenges and Opportunities

    Get PDF
    With the rapid growth of Kurdish language content on the web, there is a high demand for making this information readable and processable by machines. In order to accomplish this, the Kurdish Natural Language Processing (KNLP) pipeline is required. Computers that can process human language use the field of Natural Language Processing (NLP). In its efforts to bridge the communication gap between humans and computers, NLP draws from a wide range of fields, including computer science and computational linguistics. There have been some notable efforts made toward creating the KNLP pipeline. However, it does not support the complete NLP tasks needed to enable semantic web and text mining applications. This paper surveys the work done in the field of NLP for the Kurdish language, its applications, and linguistic challenges
    corecore